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Abstract 

We are concerned with the issue of how to calculate the normalized maximum 
likelihood (NML) code-length. There is a problem that the normalization term of the 
NML code-length may diverge when it is continuous and unbounded and a straight- 
forward computation of it is highly expensive when the data domain is finite . In 
previous works it has been investigated how to calculate the NML code-length for 
specific types of distributions. We first propose a general method for computing the 
NML code-length for the exponential family. Then we specifically focus on Gaus- 
sian mixture model (GMM), and propose a new efficient method for computing the 
NML to them. We develop it by generalizing Rissanen's re-normalizing technique. 
Then we apply this method to the clustering issue, in which a clustering structure 
is modeled using a GMM, and the main task is to estimate the optimal number of 
clusters on the basis of the NML code-length. We demonstrate using artificial data 
sets the superiority of the NML-based clustering over other criteria such as AIC, BIC 
in terms of the data size required for high accuracy rate to be achieved. 
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1 Introduction 



1.1 Motivation and Previous Works 

This paper addresses the issue of how to calculate the normalized maximum likelihood 
(NML) code-length for a given sequence. Suppose that we are given an n tuple of m- 
dimensional data x n = (x 1; ■ ■ • , x n ) G X n , where each x* G X C M. m . We define the NML 
distribution /nml relative to a model class M. = {f(X n ; 9) : 9 G 6} (n = 1, 2, • ■ ■ ) by 

7nml(x , A1J — cDW) ' 

C(Af) = J f(x n ;9(x n ),M)dx n , 

where G is a parameter spece and 9 is a maximum likelihood estimator of 9 from x n . The 
NML code-length for x n relative to Ai is calculated as follows: 

-log/ N M L (x n ;M) = -logf(x n ;9(x n ,M))+logC(M), 

It is known from [8j that the NML code-length is optimal in the sense that it achieves 
the minimum of Shtarkov's minimax criterion [12]. The NML code- length is called the 
stochastic complexity [8J and has been employed as a criterion for statistical model selection 
on the basis of the minimum description length (MDL) principle pTJ| . However, there is 
a problem that the normalization term may diverge and a straightforward computation of 
the normalization term in the NML code-length is highly expensive. The purpose of this 
paper is twofold. One is to propose a method for efficient computing the NML code-length 
for the exponential family and Gaussian mixture models. The other is to demonstrate the 
validity of its applications to optimal clustering. 

Rissanen [8] derived a formula of an asymptotic approximation of the NML code-length: 

-logp(a^0(z n )) + |log^ + log J y/\I(e)\de + o(l), 

where 1(9) is the Fisher information matrix. Note that this formula takes an asymptotic 
form. A method for exactly computing the NML code-length has been desired. In the case 
where the data domain is discrete, there is a problem that the time for a straightforward 
computation of the normalization term is exponential in data size even for the simplest 
case where the class of distributions is that of mutinomial distributions. Kontkanen and 
Myllymaki proposed efficient algorithms for the NML code-length for multinomial distri- 
butions and Naive Bayes model [61 [7]. Meanwhile, in the case where the data domain is 
continuous and not bounded, there is a problem that the normalization term may diverge 
for, e.g., Gaussian distributions. Rissanen proposed a method for circumventing this prob- 
lem for linear regression models by making an elliptic constraint for the data domain so 
that the normalization term does not diverge [9]. Giurcaneanu et. al. proposed another 



2 



method using an rhomboid constraint [3J. Note that all of these works [HIE] considered 1- 
dimensional Gaussian distributions. Hirai and Yamanishi [3] applied Rissanen's technique 
to the computation of the NML code-length for multi-variate Gaussian distributions. 

We are specifically concerned with the applications of the NML code-length to clus- 
tering. A mixture model may be used as a probabilistic model of clustering where each 
mixture component corresponds to a cluster. The estimation of the mixture size is one 
of the most important issues in clustering. Kontkanen and Myllymaki [7] proposed an 
efficient algorithm for NML-based clustering with optimal choices of mixture size for the 
case where the data domain was discrete. Hirai and Yamanishi [5] proposed an algorithm 
for efficiently computing the NML code- length for Gaussian mixture models (GMM) for 
the case where the data domain was continuous. 

1.2 Significance of This Paper 

1) An extension of the computation of the NML code-length to the exponential family. 
We extend Hirai and Yamanishi's method [5] for computing the NML code-length for 
Gaussian distributions and GMMs to exponential family including Gamma distributions, 
logistic distributions, etc. Then we give a method for calculating the NML code-length in 
a general form. 

2) An improvement of the NML code-length for Gaussian distributions and GMMs 
using the renormalizing technique. We apply Rissanen's renormalizing technique [9] into 
Gaussian distributions and GMMs to derive new formulas for computing the NML code- 
lengths for them. Conventional formulas in [5] depend on the parameters by which the 
data domain is restricted. The new formulas are obtained by renormalizing the likelihood 
with respect to the parameters, and are improved in that they are less dependent on hyper- 
parameters than those in [5] . We call the resulting code-length the renormalized maximum 
likelihood code-length (RNML). Note that the RNML are different from Rissanen's original 
one [H] in that they are derived for the case where data is multi-dimensional while Rissanen 
considered a specific case where it was 1-dimensional. 

3) An empirical demonstration of the superiority of RNML over other criteria in the 
clustering scenario. We apply the RNML code-length to the clustering scenario in which a 
GMM is used as a model for clustering. In it we employ artificial data sets to empirically 
demonstrate the validity of RNML in the estimation of the number of clusters. We show 
that the number of clusters chosen by the RNML-based criterion converges significantly 
faster to the true one than those chosen by other criteria such as AIC, BIC, and the original 
NML. 
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2 NML Code-Length for Exponential Family 

In this section, we introduce a method of computing the NML code-length for the expo- 
nential family. 

2.1 Exponential Family 

Below we define the exponential family. 

Definition 1 The probability density function belonging to the exponential family takes 
the following form: 

f(X- 9) = h(X) exp {r,(9) T T(X) - A( V (9))} , (2) 

where 9 G M. D is a real-valued parameter vector (D is the number of parameters) and A(rj) 
is a normalization term. 

The joint distribution of data x n is given as follows: 

n 

/(x»; 9) = J] h(fr) exp (r^) T T(x 4 ) - A( V (9))} . 
i=i 

Then the maximum likelihood estimate (MLE): 9{x n ) satisfies: 

1 " 

i=l 

2.2 NML Code-Length for Exponential Family 

Below we consider how to calculate the normalization term: C(M) as in ([1]) for the expo- 
nential family. Suppose that for any data, the MLE of 9 from the data can analytically 
be obtained. It is known that for the exponential family, the MLE can be calculated as a 
function of sufficient statistics. Hence we may denote the MLE as follows: 

where the Q(x) is a certain function of x. 

Below we show how to calculate C(M) by circumventing the problem that it may 
diverge. The function to be integrated is expanded as follows: 

f(x n ;9) = /i(x"|(?(x")) xexpjn^e-^^x"))-^^)} 

D 

= #( X "i#(x"))xn^(x")i0). 

d=l 
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Here we denote rje = r)(9) and define the function if(x n |#(x™)) = f £(#(x n ) = 9) (S(-) is a 
delta function), and the gd(9d(x. n )\9) is the distribution of the MLE for the <i-th part of 
the parameter 9d- Notice here that 9d is not a component of 9 but rather a part of it- a 
collection of components. We assume here that parameter parts {9d} are independent with 
respect to d. We fix 9{x n ) = 9 and let 

D 

90)= U.9M9). 

d=l 

We can calculate the normalization term C(M) by integrating g{9) with respect to 9 over 
the restricted domain as follows: 

C(M) = [ g(9)d§, 

JY{a) 

where we restrict the domain for the integral to be Y(pt) where a is a parameter by which 
the integral 9 is specified. 

In summary, for the exponential family, the NML code-length can analytically be ob- 
tained provided that the following conditions are fulfilled: 

1. The MLE of 9 can be calculated analytically. 

2. The integral of g{9) with respect to 9 can analytically be obtained. 
2.3 Examples 

Below we give examples of calculation of the NML code-lengths for the exponential family. 
For the sake of simplicity, we focus on the normalization term C(M) as in (JTJ). 

2.3.1 Gamma Distributions 

Gamma distributions belong to the exponential family. The density function of x n for a 
Gamma distribution is defined as follows: 

n , 

i=l y ' 

where k is a shape parameter and 9 is a scale parameter. 

The MLE of 9 can analytically be obtained. We consider the case where k is known 
and fixed. The MLE of 9 is given by 9{x n ) = Y^i=i x i/{kn). Thus the joint distribution of 
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x n is given as follows: 

1 n ( 1 n 1 

= #(x n | fc, 0(x n )) • #(#(:r n ); k, 9), 

where 9 is distributed according to the Gamma distribution with a shape parameter kn 
and a scale parameter 9/{kn). Hence g(9(x n ); k,9) is calculated as follows: 

Fix 9(x n ) = 9 and let H(x n \ k,9(x n )) = 5(9(x n ) = 9). Then we have 

g(9;k) d ^g(9;k,9)= -l 
yv ; yv ' ' 7 r(ifen) -e fen g 

Letting hyper-parameters be m i n , # max and the domain be 

Y(9 mhl ,9 max ) = [y n \9 min < 9{y n ) < fl max } , 

the normalization term C(M) is obtained by taking an integral of g(9; k) with respect to 
9 over Y (9 min , 9 max )as follows: 



(kn) kn r 9m ^ 1 * _ {kn) kn 9_ 
r(fcn) • e kn X min 9 ~ T{kn) ■ e kn ° g J 



max 



Hence, for fixed k, we obtain a finite value of C(M) for Gamma distributions. 
2.3.2 Logistic Distributions 

The logistic distributions belong to the exponential family. The density function of x n for 
a logistic distribution with a parameter 9 is defined as 

i=l v ' 

The MLE of 9 is analytically obtained as 9(x n ) = n/(^2™ =1 log(l + e _:r4 )). Thus the joint 
density of x n is written as 

= H(x n \9(x n ))-g(9(x n );9), 
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where n/8(x n ) is distributed according to the Gamma distribution with a shape parameter 
n and a scale parameter 1/9. Thus g(9(x n );9) is written as 

9(9M;9) = ZL.( » ) . exp l "» 



r(n) ^(V 1 ) y [ 0(a* 

Fix #(x n ) = and let # (x n \9(x n )) = 5{9{x n ) = 9). Then we have 

„n— 1 

Letting i? be a parameter, we define the restricted domain as 

Y(R) = {y n \9(y n )<R}. (3) 

Then the normalization term C(A4) is obtained by taking an integral of g{9) with respect 
to as follows: 



n n-l n n-l 



C(A^) = — / 9d9 = —— R 2 

v ; r(n) ■ e" ./ T(n) ■ e n 



Thus we obtain the normalization term C(A4) that doesn't diverge. 

3 Re- normalized Maximum Likelihood 

We show how to compute the RNML code-length for a GMM. Let x ra = (xi, • ■ ■ , x n ), = 
(xa, ■ ■ ■ ,Xi m ) T (i — 1, ■ ■ • , n) be a given sequence where Xj is distributed according to a 
Gaussian distribution with mean /x e M m and variance-covariance matrix E e ^ mxm f or a 
some positive integer m with density: 

/(x; /i,E) = — -4— xexpj - ~(x - /i) T S" 1 (x - fj)\. 

Notice here that the normalization term in ([1]) diverges. Hirai and Yamanishi j5] derived 
a formula of the NML distribution by restricting the range of data so that the maximum 
likelihood lies in a bounded range specified by parameters. It is given as follows: 

f f n „ , ^def /(x";ft(* W ),S(x")) 

JnmlIx ; K, A mm J — 



C(R,X 



mm J 
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where 

C(R,X mi n) = [ /(y";/i(y ?l ),S(y"))dy", 

JY(R,X mia ) 

Y(R,x min ) ^ {y»| ||A(y n )ll 2 < R, A^L < Xj(y n ) 

(j = l,---,m), y n G X n }, (4) 

where R, X mm = (A^; n , ■ ■ • , A^ n ) are parameters, and Aj(y n ) is the j-th largest eigenvalue 
of S(y n ). The normalization term C(R, A min ) is expanded as follows [5]: 

c w = — >< 



m-+ir(f) ' V2e7 r m (2fi)' 

If we set the parameters: i?, A m i n to be bounded, then the normalization term is also 
bounded. 

Note here that the value of the normalization term depends on the choice of parame- 
ters: R, A m i n . Next we consider the optimization of the NML code-length with respect to 
the parameters: R, A m i n . That is, we choose the optimal parameters so that they achieve 
the minimum of the following NML code-length: — log /nml( x "; R, A mm ). The values of 
R-i Amin that make the NML code-length shortest can be considered as the maximum like- 
lihood (ML) estimates from x n . The terms including R, A m in in the NML code-length are 
given as: 

m 

^log*-^5>gA&. (5) 
Considering the range of parameters: @, the ML estimates of R, A m j n are given as follows: 

R(y n ) = IIMy n )ll 2 , 

Aml^) = A,(y")0' = l,---,m). 

We then introduce hyper parameters: 7 = (Ai, A2, R\, R2) and define the renormalized 
maximum likelihood (RNML) distribution by 

, f n \ /NML(x n ;7,i?(x n ), Amin(x")) 
J RNML ;7J — —— , 

where the normalization term is expanded as follows: 



C( 7 ) = / /NML(y n ;7, J R(y n ),Ami„(y n ))dy n , 



Y(l) = {y n \V(^R 1 )<V(^R(y^))<V(^R 2 ), 
Ai < A mm (y n ) < A 2 (j = 1, • • • , m), y n e 
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where V(r) = 2ir 2 r m / (mr(y)), which denotes the volume of the m- dimensional ball with 
radius r. 



The normalization term £(7) is rewritten as 



The terms including the hyper-parameters R\, R2, Ai, A2 in the RNML code-length are 
given by 

log log — - + m log log -2 , 

while those including the parameters i?, A^- n in the NML code- length are given by (jSJ) • 
Comparing them each other, we see that the dependency of the RNML code-length on 
the hyper parameters is lower than that of the NML code-length on the parameters by 
logarithmic order. 

We further give a new formula of the RNML code-length relative to a GMM. 
Theorem 2 The RNML code-length of x n relative to a GMM is expanded as follows: 

^RNML(x n ,^; 7 ,Er) 

= - log /(x", z n ; K, A(x n , z n ), S(x n , z n )) + logC^K, n) 
+ log C 2 (K, n) + log B(x n ,z n )+K log J(m, 7) , 



where 



hiH \-h,K=n k=l 

- e ii^n (£)**■ 



hiH \-h,K=n fc=l 



(7) 



Bfx n - yj 2 m + 1 .||A P (x",z")|r-|£ p (x",^ l 

p=l ^ 

J(m,7) = C( 7 ) = ^— J -log— -^log- 



hk\ mh k I 



/fere /i^ denotes the number of data belonging to the k-th cluster, and £l p , E p denote mean 
and the ML estimates of the variance- covariance matrix for the p-th cluster. 
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Note that straightforward computation of Ci(K,n) and C 2 (K,n) as in flSJ) and ([7j) re- 
quires 0(n K ) time. Below we give methods for efficient computation of Ci(K,n) and 
C2(K,n). As for the computation of Ci(K,n), Kontkanen and Myllymaki proved the fol- 
lowing theorem: 

Theorem 3 j6] Ci(K,n) satisfies the recursive formula: 

n 

Cx{K + 2, n) = d(K + 1, n) + —C^K, n). (9) 

K 

Hence Ci(K,n) is computed in time 0(n + K). 

As for the computation of C 2 (K,n), we newly give the following result: 
Theorem 4 C2{K,n) satisfies the following formula: 

C 2 (K+l,n) = nC ri (^) n (^-J 2 C 2 {K, ri )J{r 2 ), (10) 

ri+T2=n 

where J{r 2 ) is as in (TJ|). Hence C 2 (K,n) is computed in time 0(n 2 K). 

Combining all of the theorem as above, we see that the RNML code-length of x n relative 
to a GMM is computed in time 0(n 2 K). 

4 Experimental Results 

4.1 Comparison with AIC and BIC 

This section gives experimental results showing the validity of the RNML for GMMs. We 
generated a number of data sequences of size n according to the true GMM Ai of mixture 
size K. Each mixture component is a Gaussian distribution with mean and variance- 
covariance matrix (k = 1, • ■ ■ ,K). For each data sequence x n generated according 
to the true model Ai, we also generated their corresponding cluster indices z n using the 
EM algorithm [2], where Zi showed which cluster Xj came from (i — 1, ...,n). In our 
experiment, we repeated cluster generation using the EM algorithm 100 times by changing 
initial values of the algorithm. We compared the four criteria: RNML, NML, Akaike's 
Information Criterion (AIC) [1] and Bayesian Information Criterion (BIC) [11] for the 
choice of the number of clusters. We calculated RNML and NML according to the method 
proposed in the previous sections and [5]. We calculated AIC and BIC as follows: 
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AIC(x n , z n ; K) = -2 log /(x n , z n ; K, 0(x n , z n )) 

+m(m + 3)K + K, 
BIC(x n , z n ; K) = -2 log /(x n , z n ; K, §(x n , z n )) 
m(m + 3)K 



K 



log h k + K log 



n. 



k=l 



We measured their performance in terms of the identification probability P{K) and the 
benefit B(K) defined as follows: Letting K be the true number of clusters and K* be the 
one chosen using any criterion, 



P{K) = Prob{K* = K), 

( \K* - K\ 
B(K) = max <0,1- 



T 



where T is a given constant. The identification probability P{K) is the probability that 
the algorithm outputs the true number of clusters. The benefit is a score assigned to K so 
that if K = K* it takes the maximum value 1, and it decreases linearly to zero as \K* — K\ 
increases to T. The resulting benefit is calculated as the average of the benefits taken over 
all of random generation. We compared RNML, AIC, and BIC in terms of how fast the 
identification probability and the benefit converge as sample size n increases. 

Fig. [1] and Fig. |2] show graphs of accuracy rates and benefit vs data size for the case 
where the data dimension was m = 5 and the true number of clusters was K = 3. Here we 
set T = 2 in the calculation of B>(K) in (llip . 
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Figure 2: Benefit 



Figure 3: Data Size 
Required for Accuracy 
Rate/benefit to Exceed 0.8 
vs Parameter Values 

We see from these results that RNML achieved the highest identification probability, 
the highest benefit, and the fastest rate of convergence among all of the criteria: AIC, BIC, 
NML, and RNML. Specifically this was the case when the data size was not so large. This 
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implies that RNML was effective as a criterion for selecting an optimal number of clusters 
even when the data size was relatively small. 

Table [U |2l [31 and H] show the results on benefit obtained by varying the data dimension 
and the true number of clusters, where each numerical value in Tables indicates the least 
data size required for benefit to exceed 0.8. Here Inf shows that benefit did not exceed 0.8. 

We see from these results that for most of pairs of m and K, RNML achieves high 
benefit with smaller data size than AIC, BIC, and NML. This implies that the number of 
clusters estimated by RNML is within ±1 of the true one with sufficiently high probability. 

Table 1: Data Size Required for Benefit to Table 2: Data Size Required for Benefit to 
Exceed 0.8 (RNML) Exceed 0.8 (NML) 
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Table 3: Data Size Required for Benefit to Table 4: Data Size Required for Benefit to 
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4.2 Dependency of NML and RNML on Parameters 

Figj3] shows graphs of least data size required for accuracy rate and benefit to achieve 80% 
and 0.8 versus parameter values, respectively. We define parameter 9 as 9 = R2 /R\ — A2/A1 

in RNML, and 9 = R = A^ n in NML. We see that the RNML do not depend on 
parameter values more than NML. It implies that the dependency of RNML on parameter 
values is much less than that of NML. 



5 Conclusion 

We have proposed a general method for computing the NML code-length for the exponential 
family. We have developed it by generalizing the existing method for restricting the data 
domain so that the NML code-length does not diverge. We have specifically focused on 
Gaussian distributions and GMMs to propose a new efficient method for computing the 
RNML for them. We have developed it by extending Rissanen's renormalizing technique 
into multi-variate Gaussian distributions. We have applied this method to the clustering 
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issue, in which we have selected the optimal number of clusters on the basis of the RNML 
code-length. We have empirically demonstrated using artificial data that our method makes 
the estimate of the number of clusters converge significantly faster to the true one than 
AIC, BIC, and NML. 
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